Add rlformers forward-pass features to ExecuTorch backbone for on-device export parity (#19096)#19096
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19096
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (2 Unrelated Failures)As of commit 3d5f0d6 with merge base d9688da ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@ifed-ucsd has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102030169. |
This PR needs a
|
…ice export parity (pytorch#19096) Summary: The 730M dense model checkpoint uses several rlformers features that the ExecuTorch XNNPACK export path did not implement. Without these, the exported model produces numerically incorrect output. This diff adds support for 8 missing features: 1. `normalize_tok_embeddings` — scaleless RMSNorm after embedding lookup 2. `qk_norm_before_rope` — conversion from GenAI args (attention code already supported it) 3. `scale_query_by` — custom scalar multiplier on Q after QK norm 4. `use_attn_o_gate` — sigmoid gate on attention output using a learned linear projection of the layer input 5. `use_attn_o_norm` — scaleless per-head RMSNorm on attention output (applied before o_gate) 6. `use_residual_gate` — NormPreservingResidualConnection with learned per-dim gates for both attention and FFN residual connections 7. `use_ffn_learnable_scales` — RMSNormWithInputScale replacing standard post-FFN norm, computing `rms_norm(gamma * x)` instead of `gamma * rms_norm(x)` 8. `output_soft_cap_temp` — `tanh(logits/temp) * temp` soft capping on output logits All features are off by default (backward compatible). They activate when the corresponding fields are set in the checkpoint's params.json and propagated through model_args_conversion. Weight key mappings added for: `attention.og.weight`, `add_attn.gate`, `add_ffn.gate`, `post_ffn_norm.weight`. Differential Revision: D102030169
7feccab to
9fbe028
Compare
…ice export parity (pytorch#19096) Summary: The 730M dense model checkpoint uses several rlformers features that the ExecuTorch XNNPACK export path did not implement. Without these, the exported model produces numerically incorrect output. This diff adds support for 8 missing features: 1. `normalize_tok_embeddings` — scaleless RMSNorm after embedding lookup 2. `qk_norm_before_rope` — conversion from GenAI args (attention code already supported it) 3. `scale_query_by` — custom scalar multiplier on Q after QK norm 4. `use_attn_o_gate` — sigmoid gate on attention output using a learned linear projection of the layer input 5. `use_attn_o_norm` — scaleless per-head RMSNorm on attention output (applied before o_gate) 6. `use_residual_gate` — NormPreservingResidualConnection with learned per-dim gates for both attention and FFN residual connections 7. `use_ffn_learnable_scales` — RMSNormWithInputScale replacing standard post-FFN norm, computing `rms_norm(gamma * x)` instead of `gamma * rms_norm(x)` 8. `output_soft_cap_temp` — `tanh(logits/temp) * temp` soft capping on output logits All features are off by default (backward compatible). They activate when the corresponding fields are set in the checkpoint's params.json and propagated through model_args_conversion. Weight key mappings added for: `attention.og.weight`, `add_attn.gate`, `add_ffn.gate`, `post_ffn_norm.weight`. Differential Revision: D102030169
9fbe028 to
8b2da23
Compare
…ice export parity (pytorch#19096) Summary: Pull Request resolved: pytorch#19096 The 730M dense model checkpoint uses several rlformers features that the ExecuTorch XNNPACK export path did not implement. Without these, the exported model produces numerically incorrect output. This diff adds support for 8 missing features: 1. `normalize_tok_embeddings` — scaleless RMSNorm after embedding lookup 2. `qk_norm_before_rope` — conversion from GenAI args (attention code already supported it) 3. `scale_query_by` — custom scalar multiplier on Q after QK norm 4. `use_attn_o_gate` — sigmoid gate on attention output using a learned linear projection of the layer input 5. `use_attn_o_norm` — scaleless per-head RMSNorm on attention output (applied before o_gate) 6. `use_residual_gate` — NormPreservingResidualConnection with learned per-dim gates for both attention and FFN residual connections 7. `use_ffn_learnable_scales` — RMSNormWithInputScale replacing standard post-FFN norm, computing `rms_norm(gamma * x)` instead of `gamma * rms_norm(x)` 8. `output_soft_cap_temp` — `tanh(logits/temp) * temp` soft capping on output logits All features are off by default (backward compatible). They activate when the corresponding fields are set in the checkpoint's params.json and propagated through model_args_conversion. Weight key mappings added for: `attention.og.weight`, `add_attn.gate`, `add_ffn.gate`, `post_ffn_norm.weight`. Differential Revision: D102030169
8b2da23 to
62852da
Compare
…ice export parity (pytorch#19096) Summary: The 730M dense model checkpoint uses several rlformers features that the ExecuTorch XNNPACK export path did not implement. Without these, the exported model produces numerically incorrect output. This diff adds support for 8 missing features: 1. `normalize_tok_embeddings` — scaleless RMSNorm after embedding lookup 2. `qk_norm_before_rope` — conversion from GenAI args (attention code already supported it) 3. `scale_query_by` — custom scalar multiplier on Q after QK norm 4. `use_attn_o_gate` — sigmoid gate on attention output using a learned linear projection of the layer input 5. `use_attn_o_norm` — scaleless per-head RMSNorm on attention output (applied before o_gate) 6. `use_residual_gate` — NormPreservingResidualConnection with learned per-dim gates for both attention and FFN residual connections 7. `use_ffn_learnable_scales` — RMSNormWithInputScale replacing standard post-FFN norm, computing `rms_norm(gamma * x)` instead of `gamma * rms_norm(x)` 8. `output_soft_cap_temp` — `tanh(logits/temp) * temp` soft capping on output logits All features are off by default (backward compatible). They activate when the corresponding fields are set in the checkpoint's params.json and propagated through model_args_conversion. Weight key mappings added for: `attention.og.weight`, `add_attn.gate`, `add_ffn.gate`, `post_ffn_norm.weight`. Differential Revision: D102030169
62852da to
6d2a84a
Compare
digantdesai
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
e686bed to
18e2b65
Compare
5c8ace1 to
85c4f8c
Compare
…ice export parity (pytorch#19096) Summary: The 730M dense model checkpoint uses several features that the ExecuTorch XNNPACK export path did not implement. Without these, the exported model produces numerically incorrect output. This diff adds support for 8 missing features: 1. `normalize_tok_embeddings` — scaleless RMSNorm after embedding lookup 2. `qk_norm_before_rope` — conversion from GenAI args (attention code already supported it) 3. `scale_query_by` — custom scalar multiplier on Q after QK norm 4. `use_attn_o_gate` — sigmoid gate on attention output using a learned linear projection of the layer input 5. `use_attn_o_norm` — scaleless per-head RMSNorm on attention output (applied before o_gate) 6. `use_residual_gate` — NormPreservingResidualConnection with learned per-dim gates for both attention and FFN residual connections 7. `use_ffn_learnable_scales` — RMSNormWithInputScale replacing standard post-FFN norm, computing `rms_norm(gamma * x)` instead of `gamma * rms_norm(x)` 8. `output_soft_cap_temp` — `tanh(logits/temp) * temp` soft capping on output logits Additionally, this diff fixes a QK norm checkpoint compatibility issue: some checkpoints contain learned QK norm weights even though their `params.json` has `qk_norm_affine=False` (due to default changes after training). The ET model was creating `ScalelessRMSNorm` (no weight parameter) based on `params.json`, silently discarding the checkpoint's trained QK norm weights. The rlformers reference model loaded them correctly, causing ~53-67 dB SNR divergence. The fix peeks at the checkpoint state dict before model construction — if QK norm weights are present, `qk_norm_affine` is overridden to `True` so the ET model creates affine QK norms that load those weights. All features are off by default (backward compatible). They activate when the corresponding fields are set in the checkpoint's params.json and propagated through model_args_conversion. Weight key mappings added for: `attention.og.weight`, `add_attn.gate`, `add_ffn.gate`, `post_ffn_norm.weight`. Reviewed By: chinnadhurai, digantdesai Differential Revision: D102030169
85c4f8c to
3d5f0d6
Compare
Summary:
The 730M dense model checkpoint uses several features that the ExecuTorch XNNPACK export path did not implement. Without these, the exported model produces numerically incorrect output.
This diff adds support for 8 missing features:
normalize_tok_embeddings— scaleless RMSNorm after embedding lookupqk_norm_before_rope— conversion from GenAI args (attention code already supported it)scale_query_by— custom scalar multiplier on Q after QK normuse_attn_o_gate— sigmoid gate on attention output using a learned linear projection of the layer inputuse_attn_o_norm— scaleless per-head RMSNorm on attention output (applied before o_gate)use_residual_gate— NormPreservingResidualConnection with learned per-dim gates for both attention and FFN residual connectionsuse_ffn_learnable_scales— RMSNormWithInputScale replacing standard post-FFN norm, computingrms_norm(gamma * x)instead ofgamma * rms_norm(x)output_soft_cap_temp—tanh(logits/temp) * tempsoft capping on output logitsAdditionally, this diff fixes a QK norm checkpoint compatibility issue: some checkpoints contain learned QK norm weights even though their
params.jsonhasqk_norm_affine=False(due to default changes after training). The ET model was creatingScalelessRMSNorm(no weight parameter) based onparams.json, silently discarding the checkpoint's trained QK norm weights. The rlformers reference model loaded them correctly, causing ~53-67 dB SNR divergence. The fix peeks at the checkpoint state dict before model construction — if QK norm weights are present,qk_norm_affineis overridden toTrueso the ET model creates affine QK norms that load those weights.All features are off by default (backward compatible). They activate when the corresponding fields are set in the checkpoint's params.json and propagated through model_args_conversion.
Weight key mappings added for:
attention.og.weight,add_attn.gate,add_ffn.gate,post_ffn_norm.weight.Reviewed By: chinnadhurai, digantdesai
Differential Revision: D102030169